Skip to content

feat(cosmos3): add Cosmos3 Super Omni inference tasks#1196

Merged
llmc-reviewer merged 1 commit into
mainfrom
gsq/dev-cosmos-3-super-omni
Jul 1, 2026
Merged

feat(cosmos3): add Cosmos3 Super Omni inference tasks#1196
llmc-reviewer merged 1 commit into
mainfrom
gsq/dev-cosmos-3-super-omni

Conversation

@gushiqiao

Copy link
Copy Markdown
Contributor

Summary

Add end-to-end LightX2V inference support for Cosmos3 Super / Cosmos3 Super Omni tasks, with configs and scripts aligned by task name.

Supported Tasks

  • t2i / cosmos3_super_t2i

    • Text-to-image generation from a text prompt.
  • t2v / cosmos3_super_omni_t2v

    • Text-to-video generation from a text prompt.
  • i2v / cosmos3_super_i2v, cosmos3_super_omni_i2v

    • Image-to-video generation conditioned on an input first frame plus prompt.
  • t2av / cosmos3_super_omni_t2av

    • Text-to-audio-video generation from a prompt, producing video with generated audio.
  • i2av / cosmos3_super_omni_i2av

    • Image-to-audio-video generation from a first frame and prompt, producing video with generated audio.
  • i2va forward dynamics / cosmos3_super_omni_action_fd_agibotworld

    • Action-conditioned video rollout from an initial observation image and a provided robot action chunk.
  • i2va multi-chunk forward dynamics / cosmos3_super_omni_action_fd_agibotworld_multichunk

    • Autoregressive multi-segment action rollout: each generated segment feeds its last frame into the next segment, using subsequent action chunks.
  • v2av inverse dynamics / cosmos3_super_omni_action_id_av

    • Inverse dynamics for the autonomous-driving domain: condition on an observed video and predict the corresponding action sequence, saving action output to JSON.

Implementation Notes

  • Reuses the Cosmos3 runner/model path instead of importing diffusers model code.
  • Adds action conditioning support for forward dynamics, inverse dynamics, and multi-chunk rollout.
  • Adds audio decoding/muxing support for Omni audio-video tasks.
  • Aligns Cosmos3 config filenames with the script names under scripts/cosmos3.
  • Keeps per-task configs and scripts consistent with existing LightX2V style.

Validation

  • Verified Cosmos3 config JSON files parse successfully.
  • Verified Cosmos3 shell scripts reference existing matching config files.
  • Verified updated Cosmos3 runner Python syntax compiles successfully.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Cosmos3 models, enabling multi-modal generation capabilities including video, sound, and action outputs. Key changes include the addition of a sound tokenizer, updates to the inference and post-inference modules to handle sound and action segments, and runner/scheduler support for multi-chunk action rollouts and audio muxing. The review feedback highlights several improvement opportunities: optimizing video loading by breaking early when only the first frame is needed, adding UTF-8 encoding when reading prompt files, wrapping the ffmpeg subprocess in a try-except block to prevent pipeline crashes, and safely defaulting action_domain_id to prevent a TypeError when converting to a tensor.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +398 to +402
try:
for frame in reader:
frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width))
if len(frames) >= num_frames:
break

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When keep_first is True, we only need the first frame of the video. However, the current loop continues decoding up to num_frames frames from the reader before discarding them. Breaking early when keep_first is True significantly improves performance by avoiding redundant video decoding.

Suggested change
try:
for frame in reader:
frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width))
if len(frames) >= num_frames:
break
try:
for frame in reader:
frames.append(self._frame_array_to_tensor(np.asarray(frame), height, width))
if keep_first or len(frames) >= num_frames:
break


action_domain_ids = None
if action_latents is not None:
action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If action_domain_id is None, calling torch.as_tensor(None) will raise a TypeError. Defaulting it to 0 (representing no_action) prevents this potential crash.

Suggested change
action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1)
action_domain_id = 0 if action_domain_id is None else action_domain_id
action_domain_id = torch.as_tensor(action_domain_id, device=device, dtype=torch.long).reshape(1)

Comment on lines +147 to +151
if text.endswith(".json"):
with open(text, "r") as f:
return json.dumps(json.load(f))
with open(text, "r") as f:
return f.read().strip()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When opening files for reading prompt text, it is highly recommended to specify encoding="utf-8" to prevent potential UnicodeDecodeError on systems where the default encoding is not UTF-8 (e.g., Windows).

Suggested change
if text.endswith(".json"):
with open(text, "r") as f:
return json.dumps(json.load(f))
with open(text, "r") as f:
return f.read().strip()
if text.endswith(".json"):
with open(text, "r", encoding="utf-8") as f:
return json.dumps(json.load(f))
with open(text, "r", encoding="utf-8") as f:
return f.read().strip()

Comment on lines +630 to +635
result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if result.returncode != 0:
stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error"
logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}")
return
os.replace(tmp_video_path, video_path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If ffmpeg is not installed or fails to execute, subprocess.run can raise a FileNotFoundError or other OS exceptions. Wrapping the execution in a try-except block ensures that the entire inference pipeline does not crash at the very end, allowing the silent video to be preserved as intended.

Suggested change
result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if result.returncode != 0:
stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error"
logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}")
return
os.replace(tmp_video_path, video_path)
try:
result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if result.returncode != 0:
stderr = result.stderr.decode(errors="ignore") if result.stderr else "Unknown error"
logger.warning(f"Cosmos3 generated audio mux failed, keep silent video. Error: {stderr}")
return
os.replace(tmp_video_path, video_path)
except Exception as e:
logger.warning(f"Cosmos3 generated audio mux failed with exception, keep silent video. Error: {e}")

@llmc-reviewer llmc-reviewer merged commit 2cbe1f2 into main Jul 1, 2026
2 checks passed
@llmc-reviewer llmc-reviewer deleted the gsq/dev-cosmos-3-super-omni branch July 1, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants